Skip to content

[AMD] feat: MiniMax M3 day-zero benchmark for MI325X#1748

Open
cquil11 wants to merge 1 commit into
mainfrom
codex/minimaxm3-mi325x-dayzero
Open

[AMD] feat: MiniMax M3 day-zero benchmark for MI325X#1748
cquil11 wants to merge 1 commit into
mainfrom
codex/minimaxm3-mi325x-dayzero

Conversation

@cquil11

@cquil11 cquil11 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add minimaxm3-fp8-mi325x-vllm for MiniMax M3 MXFP8 on MI325X
  • use vllm/vllm-openai-rocm:minimax-m3 and the official MI325X MXFP8 recipe shape
  • mirror the H200 non-MTP search space: TP4/TP8 latency, TP4/TP8 expert parallelism, and TP8 data-parallel attention across 1k1k and 8k1k
  • route Hugging Face cache to node-local /local-nvme/hf-hub-cache/ and runtime compiler caches to container-local /tmp
  • disable prefix caching for random-dataset benchmarks
  • mount /dev/kfd and /dev/dri explicitly for ROCm
  • use the default BF16 KV cache because FP8 KV corrupts MiniMax M3 generation on this MI325X/gfx942 image

Recipe Alignment

  • model: MiniMaxAI/MiniMax-M3-MXFP8
  • image: vllm/vllm-openai-rocm:minimax-m3
  • --block-size 128
  • --attention-backend TRITON_ATTN
  • --language-model-only
  • --no-enable-prefix-caching
  • MiniMax M3 tool/reasoning parsers with automatic tool choice
  • no MI355X-specific --enforce-eager workaround

Upstream reference: https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3?hardware=mi325x&variant=mxfp8

Validation

Representative throughput smoke: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27482912444

  • vLLM started successfully on MI325X with the PR command
  • model downloaded through node-local /local-nvme/hf-hub-cache
  • all checkpoint shards loaded; CUDA graph capture completed
  • 40 random 1k1k requests at TP4 / EP1 / concurrency 4 completed per runner
  • result processing, result upload, server-log upload, GPU-metrics upload, aggregation, and success-rate calculation succeeded

Targeted DPA accuracy validation: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27484953170

  • exact failed full-sweep point: TP1 x DP8 + EP, concurrency 512, 8k1k eval-only
  • server startup and all 1,319 GSM8K requests completed
  • GSM8K strict exact match: 0.9575
  • GSM8K flexible exact match: 0.9568
  • score validation, server-log upload, GPU-metrics upload, eval artifact upload, aggregation, and success-rate calculation succeeded

Failure diagnosis: --kv-cache-dtype fp8 produced deterministic repetitive/cross-prompt generation corruption and 1-2% GSM8K. On the same node, image, weights, and layouts, removing only FP8 KV restored correct generation with and without CUDA graphs. The PR therefore leaves KV cache at vLLM's default dtype.

Additional validation:

  • shell syntax, YAML parsing, matrix generation, and git diff --check pass
  • matrix matches the H200-aligned 31-point search space
  • /enroot resolves to local NVMe on every healthy compute node
  • XDG_CACHE_HOME and TRITON_CACHE_DIR use per-job local paths, avoiding stale NFS compiler artifacts

Full PR sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27485135330

Changelog Integrity

perf-changelog.yaml is current main byte-for-byte followed only by this PR's entry at the tail.


Note

Low Risk
Benchmark and CI runner configuration only; MI325X launcher changes affect cache paths and GPU device visibility but are scoped to the AMD Slurm launch path.

Overview
Adds MiniMax-M3 MXFP8 single-node vLLM benchmarking on MI325X via a new minimaxm3-fp8-mi325x-vllm matrix entry, a minimaxm3_fp8_mi325x.sh runner aligned to the official MI325X recipe (vllm/vllm-openai-rocm:minimax-m3, block size 128, TRITON_ATTN, MiniMax parsers, default BF16 KV), and an H200-style search space (TP4/TP8, EP, TP8 DPA) for 1k1k and 8k1k.

launch_mi325x-amds.sh is updated for all MI325X jobs: Hugging Face hub cache moves from NFS to /local-nvme/hf-hub-cache/, per-job XDG_CACHE_HOME and TRITON_CACHE_DIR under /tmp, and explicit /dev/kfd / /dev/dri mounts for ROCm in the container.

Reviewed by Cursor Bugbot for commit 6abc71f. Bugbot is set up for automated code reviews on this repo. Configure here.

@cquil11 cquil11 marked this pull request as ready for review June 13, 2026 23:48
@cquil11 cquil11 requested a review from a team June 13, 2026 23:48
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@cquil11 cquil11 marked this pull request as draft June 13, 2026 23:48
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cquil11's task in 2m 46s —— View job


Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

  • Read changed files and gather context
  • Validate master config and perf-changelog
  • Validate benchmark script code style and patterns
  • Validate launch script changes
  • Post review findings

LGTM — no blocking issues found.

Validation summary:

  • Config + perf-changelog both updated ✅
  • Perf-changelog entry appended at end of file ✅
  • Image vllm/vllm-openai-rocm:minimax-m3 is publicly accessible (Docker Hub) and matches MI355X ✅
  • Launch script has enroot import docker:// pattern for reproducibility ✅
  • Expert parallelism is correctly conditional on EP_SIZE / DP_ATTENTION
  • Server launch command is properly formatted with separate lines ✅
  • MI325X search space is an exact match of the H200 non-MTP search space (verified both 1k1k and 8k1k) ✅
  • Benchmark script is functionally identical to the MI355X variant (minimaxm3_fp8_mi355x.sh), differing only in the header comment ✅

Note (non-blocking): The launch_mi325x-amds.sh changes (HF cache path to /local_nvme/ and /dev/kfd+/dev/dri mounts) apply to all MI325X benchmarks, not just MiniMax M3. The device mounts are additive so shouldn't break existing jobs, and moving to node-local NVMe is a reasonable infra choice for faster model loading — just noting the scope.

@cquil11 cquil11 marked this pull request as ready for review June 13, 2026 23:49
@cquil11 cquil11 marked this pull request as draft June 13, 2026 23:50
@cquil11 cquil11 marked this pull request as ready for review June 13, 2026 23:51
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cquil11's task in 1m 10s —— View job


Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

  • Read changed files and gather context
  • Validate master config and perf-changelog
  • Validate container image accessibility
  • Validate benchmark script (code style, EP pattern, server formatting)
  • Validate launch script (enroot import, device mounts)
  • Validate model prefix
  • Post review findings

LGTM — no blocking issues found.

Validation summary:

  • Master config + perf-changelog: Both updated, perf-changelog entry correctly appended at EOF ✅
  • Model prefix: minimaxm3 matches existing entries across AMD and NVIDIA configs ✅
  • Container image: vllm/vllm-openai-rocm:minimax-m3 on Docker Hub — publicly accessible ✅
  • Enroot import: launch_mi325x-amds.sh has the enroot import docker:// pattern ✅
  • Expert parallelism: Correctly conditional on EP_SIZE / DP_ATTENTION with the standard pattern ✅
  • Server command formatting: Arguments on separate lines with \ continuations ✅
  • H200 search space alignment: MI325X search space is an exact match of the H200 non-MTP search space (both 1k1k and 8k1k) ✅
  • MI355X vs MI325X diff: Only differences are the header comment and removal of --enforce-eager (intentional per PR description — MI355X-specific workaround) ✅

Note (non-blocking): The launch_mi325x-amds.sh changes (HF cache path to /local_nvme/ and /dev/kfd+/dev/dri mounts) apply to all MI325X benchmarks, not just MiniMax M3. The device mounts are additive and the NVMe cache path is a reasonable infra choice — just noting the blast radius.

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5ec3e11. Configure here.

--no-enable-prefix-caching \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing FP8 KV cache flag

Medium Severity

The new MI325X vllm serve invocation omits --kv-cache-dtype fp8 even though the PR recipe alignment, changelog, and the existing minimaxm3_fp8_mi355x.sh baseline all specify FP8 KV cache. Without it, vLLM may use a non-FP8 KV default, skewing memory headroom and throughput versus the official MI325X MXFP8 recipe and other MiniMax M3 entries.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5ec3e11. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Rebased onto main: MiniMax-M3 MXFP8 MI325X day-zero recipe (script +
amd-master entry + perf-changelog + mi325x launcher tuning), plus
VLLM_USE_BREAKABLE_CUDAGRAPH=0 so the recipe runs with CUDA graphs.
Consolidated the branch's commits onto current main (which now carries
the mi300x non-MTP/MTP recipes) to resolve the amd-master/changelog
EOF-append conflicts.

Co-Authored-By: functionstackx <47992694+functionstackx@users.noreply.github.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the codex/minimaxm3-mi325x-dayzero branch from f78392f to 6abc71f Compare June 14, 2026 07:24
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant